Latent variables and state-space models

Jasper Slingsby

Latent variables & state-space models

The iterative ecological forecasting cycle in the context of the scientific method, demonstrating how we stand to learn from making iterative forecasts.

  • Iterative ecological forecasting typically requires modelling variables that vary in time and space…


  • State-space models are a general statistical framework that is particularly well-suited to this problem.


  • Here I provide a brief introduction to state-space models, but first I must introduce latent variables.


Latent variables

Variables are latent if they are unobserved or estimated with uncertainty. Their true value is not known and can only be inferred indirectly through a model from other variables that can be directly observed or measured.

Dietze 2017 outlines 4 common latent variable models:

  1. observation errors (random and systematic)
  2. proxy data
  3. missing data
  4. unobserved variables

Observation error

I’ve already mentioned that a big challenge to modelling is error in the observation of the state variable of interest.


Observation errors are typically either:

  • random, due to imprecision of the data collection process or other extraneous factors, or

  • systematic, implying there is some bias

Precision vs accuracy

The true value is the origin (0,0).


Imprecision in measurement creates random error.


Inaccuracy creates systematic error.

Random observation error

Random error is most commonly created by imprecision in measurement (“scatter”) around the true state of the variable of interest, but can be created by other processes.


In this case we may want to model the true state as a latent variable, and model the random observation error as a probability distribution (typically Gaussian) with mean of 0 (or the mean of the process model).


E.g. specifying the data model to be a normal distribution, as we did in the post-fire recovery model:

\[\begin{gather} NDVI_{i,t}\sim \mathit{N}(\mu_{i,t},\frac{1}{\sqrt{\tau}}) \\ \end{gather}\]

Systematic observation error

Systematic error is where there is a bias, such as created by differences among observers or poor instrument calibration.


Constant bias can be corrected with an offset, but something like sensor drift may need to be approximated as a random walk or similar (to account for temporal autocorrelation).


If we have more information about the causes of error, we can apply more complex observation models (e.g. differences among field staff, etc).


Often there is both random and systematic error, requiring a model that accounts for both.

Proxy data

I.e. observing a proxy for the state variable of interest, e.g.

  • Normalised Difference Vegetation Index (NDVI) for plant cover or vegetation health
  • Time-domain reflectometry (TDR) for soil moisture
  • Dung as a measure of herbivore habitat preference

There are many ways to relate the observed proxy(ies) to the latent state variable of interest, such as empirical calibration curves, probabilities of identifying dung correctly, etc.

Missing data


Where some observations may be missing from the data, these may be estimated with uncertainty in various ways.


Missing data are common in time series or in space (e.g. sensor failure, logistical difficulties, etc.).

Unobserved variables

Some variables may never be observed (e.g too difficult to measure), but can be inferred from the process model, e.g.:

  • soil stored seedbanks in plant demographic models
  • determinants of resource allocation decisions in organisms


Estimating these latent variables can be tricky, but having multiple independent measures to constrain the estimates or high confidence in the model structure (i.e. mechanistic understanding) can help.

State-space models

Forecasting involves predicting key variables further in time, and often farther in space.


It usually has to deal with a number of latent variables due to missing or sparse data, observation error, etc, and these are often connected in time and/or space (i.e. autocorrelated).


State-space models are a useful framework for dealing with these kinds of problems and for forecasting in general.

State-space models

For time-series of discrete states variables (i.e. categorical response) they are also referred to as Hidden Markov models

  • “hidden” refers to the latent variable(s)
  • “Markov” refers to their recursive nature, with the next state in time a function of the current state

When extended to spatial or space-time models they are called Markov random fields

State-space models

The name comes from the focus on estimating latent state variables.

In doing so, they explicitly separate observation errors from process errors.

Autocorrelation…

State-space models

Illustration of a simple univariate state-space model from Auger-Methe et al. 2021.

  1. Once the dependence of the observations \(y_t\) on the states \(z_t\) is accounted for, the observations are assumed to be independent.

  2. A toy model demonstrating that the state estimates (mean and 95% confidence interval - black line and grey band respectively) can be a closer approximation of the true states (red dots) than the observations (blue dots).

But first…

Before I can introduce Bayes, there are a few basic building blocks we need to establish first.

  1. How the method of Least Squares works, and its limitations, especially it’s inflexible, implicit data model.
  1. The concept of likelihood and the method of maximum likelihood estimation
    • this is a major component of Bayes’ Theorem
    • its a frequentist method, but has the advantages 1 & most of 2 above
      • in fact, there are frequentist approaches for doing much of the advantages listed on the previous slide, but they are typically cumbersome. Bayes you can do it all without much extra work.

Least Squares

Traditional parametric statistics like regression analysis and analysis of variance (ANOVA) rely on Ordinary Least Squares (OLS).


There are other flavours of least squares that allow more flexibility (e.g. nonlinear (NLS) that we used in the practical, partial least squares, etc), but I’m not going to go into these distinctions.

Least Squares

In general, least squares approaches “fit” (i.e. estimate the parameters of) models by minimizing the sums of the squared residuals.

The model (blue line) is drawn through the points to minimize the sum of the squared vertical (y-axis) differences between each point and the regression line (i.e. residuals).

Redrawn highlighting the residuals:

  • The residuals are the grey lines (lollypop sticks) linking each observed data point to the values predicted by the model (open circles along the regression line)
  • They are vertical and not orthogonal to the regression line, because they represent the variance in Y (Reward) that is not explained by X (Effort)
  • There is no scatter in the predicted values (open circles), because the scatter is residual variance that the model cannot account for and predict.

A histogram of the residuals:

The residuals approximate a normal distribution

This is a key assumption when using Least Squares

  • termed “homoscedasticity” or “homogeneity of variance”
  • For minimizing the sums of squares to work, a unit change in the residuals should be scale invariant
    • i.e. the difference between 1 and 2 or 101 and 102 must be the same
    • This is only true when the residuals are normally distributed (versus log-scale for example)
    • If not, minimizing the sums of squares does not work (and people often try to log transform etc their data to get them to an invariant scale).
  • If violated, consider a different process model (i.e. nonlinear), or another statistical approach (MLE or Bayes).

The shortcomings of Least Squares:


1. Least Squares doesn’t explicitly include a data model

It’s useful at this stage to make a distinction between data models and process models.

  • The process model is the bit you’ll be used to, where we describe how the model creates a prediction for a particular set of inputs or covariates (e.g. a linear model)

  • The data model describes the residuals (i.e. mismatch between the process model and the data)

    • also often called the data observation process

Least Squares analyses don’t explicitly include a data model, because minimizing the sums of squares means that the data model in a Least Squares analysis can only ever be a normal distribution (i.e. there must be homogeneity of variance)

Why is this a limitation?

  1. in reality, the data model can take many forms
    • e.g. Binomial coin flips, Poisson counts of individuals, Exponential waiting times, etc
    • this is where Maximum Likelihood comes into it’s own
      • in the practical, we specify two functions for the MLE analyses:
        • the process model (pred.negexp/S)
        • the likelihood function (fit.negexp/S.MLE), which includes the data model
  1. sometimes one would like to include additional information in the data model
    • e.g. sampling variability (e.g. different observers or instruments), measurement errors, instrument calibration, proxy data, unequal variances, missing data, etc
    • this is where Bayesian models come into their own

The shortcomings of Least Squares:


2. Least squares focuses on what the parameters are not, rather than what they are

  • Least Squares focuses on null hypothesis testing - the ability to reject (or to fail to reject) the null hypothesis at some threshold of significance (alpha; usually P < 0.05)

  • For a linear model, the null hypothesis is that the slope = 0 (i.e. no effect of X on Y)

  • A linear model is only considered useful when you can reject the null hypothesis.

    • This just tells you that the probability of the parameter of interest (the slope in this case) being 0 is very low…
  • It tells you nothing about the probability of the parameters actually being the estimates you arrives at by minimizing the sums of squares!!!?

Maximum likelihood

The likelihood principle = a parameter value is more likely than another if it is the one for which the data are more probable.

Maximum Likelihood Estimation (MLE) = a method for estimating model parameters by applying the likelihood principle. It optimizes parameter values to maximize the likelihood that the process described by the model produced the data observed.

The probability of a data point (dotted line) being generated under two alternative hypotheses (sets of parameter values). Here H1 is more likely, because the probability density at \(x\) = 1 is higher for H1 than H2 (roughly 0.25 vs 0.025 = 10 times more likely). Usually there’d be a continuum of hypotheses (parameter values) to select from, and there’d be far more data points.

Maximum likelihood

Viewed differently…

With MLE we assume we have the correct model and use MLE to choose parameter values that maximize the conditional probability of the data given those parameter values, \(P(Data|Parameters)\).

  • This leaves us knowing the likelihood of the best set of parameters and the conditional probability of the data given those parameters \(P(Data|Parameters)\)

BUT!

What we really want is to know is the conditional probability of the parameters given the data, \(P(Parameters|Data)\), because this allows us to express uncertainty in the parameter estimates as probabilities.

  • To get there, we apply a little probability theory, with a surprisingly useful byproduct!

First, let’s brush up on joint, conditional and marginal probabilities

The interactions between two random variables are illustrated below…

Venn diagram illustrating ten events (points) across two sets \(x\) and \(y\).

Joint probability

Venn diagram illustrating ten events (points) across two sets \(x\) and \(y\).

  • The joint probability, \(P(x,y)\), is the probability of both \(x\) and \(y\) occurring simultaneously

  • This is the probability of occurring within the overlap of the circles = 3/10

  • Needless to say the joint probability of \(P(y,x)\) is identical to \(P(x,y)\) (= 3/10)

Conditional probability

Venn diagram illustrating ten events (points) across two sets \(x\) and \(y\).

We can also define two conditional probabilities:

  • the probability of \(x\) given \(y\), \(P(x|y)\)
  • the probability of \(y\) given \(x\), \(P(y|x)\)

In the diagram these would be:

  • the probability of being in set \(x\) given we are only considering points in set \(y\) (= 3/6)
  • the probability of being in set \(y\) given we are only considering points in set \(x\) (= 3/7)

Marginal probability

Venn diagram illustrating ten events (points) across two sets \(x\) and \(y\).

Last, we can define two marginal probabilities for \(x\) and \(y\).

These are just the separate probabilities of being in set \(x\) or being in set \(y\) given the full set of events, i.e.

  • \(P(x)\) = 7/10 and
  • \(P(y)\) = 6/10

Joint, conditional and marginal probabilities

We can also show that the joint probabilities are the product of the conditional and marginal probabilities:

\[P(x,y) = P(x|y) P(y) = 3/6 * 6/10 = 0.3\]

and:

\[P(y,x) = P(y|x) P(x) = 3/7 * 7/10 = 0.3\]

Tadaa!

This means we can derive what we want to know, \(P(Parameters|Data)\), as a function of the information maximum likelihood estimation provides, \(P(Data|Parameters)\).

…Let’s do the derivation…

Bayes’ Theorem

Now let:

  • parameters = \(\theta\)
  • the data = \(D\)


We’re interested in the conditional probability of the parameters given the data, \(p(\theta|D)\).


To get this, we need to take the previous equations and solve for \(p(\theta|D)\).

Since we know the joint probabilities are identical:

\[p(\theta,D) = p(\theta|D)p(D)\]

\[p(D,\theta) = p(D|\theta)p(\theta)\]

We can take the right hand side of the two:

\[p(\theta|D)p(D) = p(D|\theta)p(\theta)\]

and solve for \(p(\theta|D)\) as:

\[p(\theta|D) = \frac{p(D|\theta) \; p(\theta)}{p(D)} \;\;\]

which is known as Bayes’ Theorem!

Bayes’ Theorem

Rewriting the terms on one line allows us to label them with their names:

\[ \underbrace{p(\theta|D)}_\text{posterior} \; = \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; / \; \underbrace{p(D)}_\text{evidence} \]

The evidence

The marginal probability of the data, \(p(D)\), is now called the evidence for the model, and represents the overall probability of the data according to the model.

  • having \(p(D)\) in the denominator means it normalizes the numerators to ensure that the updated probabilities sum to 1 over all possible parameter values - i.e. it ensures that the posterior, \(p(\theta|D)\), is expressed as a probability distribution
  • \(p(D)\) can otherwise mostly be ignored and is often left out, resulting in…

Bayes’ Rule

Getting rid of the evidence allows us to focus on the important bits:

\[ \underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; \]

Which reads “The posterior is proportional to the likelihood times the prior”.

This leaves us with three terms:


The likelihood

\(p(D|\theta)\) still represents the probability of the data given the model with parameter values \(\theta\), and is used in analyses to find the likelihood profiles of the parameters.

Bayes’ Rule

\[ \underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; \]

The posterior

  • What we have been calling the conditional probability of the parameters given the data, \(p(\theta|D)\), is now called the posterior
  • It gives us a probability distribution for the values any parameter can take, allowing us to represent uncertainty in the model and forecasts as probabilities, which is the holy grail we were after!!!
    • i.e. the posterior(s) is/are the parameter(s) of interest that we forecast, and we can now indicate the probability of our forecast being correct

Bayes’ Rule

\[ \underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; \]

The prior

  • The marginal probability of the parameters, \(p(\theta)\), is now called the prior.
  • While \(p(\theta)\) is almost an unexpected consequence of having solved the equation for \(p(\theta|D)\), the prior is where a lot of the magic of Bayes lies!
  • Firstly, we know that \(p(\theta)\) is a necessary requirement for us to be able to represent the posterior, \(p(\theta|D)\), as a probability distribution (what we were looking for).
  • But, it also represents the credibility of the parameter values, \(\theta\), without the data, \(D\).
    • What does this mean?

The prior

The prior represents the credibility of the parameter values, \(\theta\), without the data, \(D\).

“But how can we know anything about the parameter values without the data?”


By applying the scientific method, whereby we interrogate new evidence (the data) in the context of previous knowledge or information to update our understanding.

  • \(p(\theta)\) is the prior belief of what the parameters should be, before interrogating the data
    • i.e. our “context of previous knowledge or information”
  • if we don’t have much previous knowledge, we can specify the prior to represent this
  • the prior is incredibly powerful! But the Peter Parker Principle applies - “With great power comes great responsibility!”

The power of the prior

  1. It allows Bayesian analyses to be iterative. The posterior from one analysis can become the prior for the next!
    • This provides a formal probabilistic framework for the scientific method!
      • New evidence must be considered in the context of previous knowledge, providing the opportunity to update our beliefs.
    • This is ideal for iterative forecasting, because the previous forecast can be the prior for the next.

The power of the prior

  1. It makes Bayesian models incredibly flexible, allowing us to specify complex hierarchical models in a completely probabilistic framework relatively intuitively.
    • This is fantastic for “fusing” disparate data sources!
      • E.g. we may not have “prior beliefs” about a parameter’s values, but we may know something about another parameter or process that influences it. If so, we can specify a prior on our prior (called a hyperprior).
    • We can also specify separate or linked priors (or hyperpriors) for different parameters (or priors).

The prior comes with great responsibility…

It is very easy to specify an inappropriate prior and bias the outcome of your analysis!

  • First and foremost, as tempting as it may be, the one place the prior CANNOT come from is the data used in the analysis!!!
  • Second, while many people favour specifying “uninformative” priors, it is often incredibly difficult to know what “uninformative” is for different models or parameters.

We’ll get stuck into examples showing the value of Bayes in ecology in the next lecture.

References